50 research outputs found
Language models in molecular discovery
The success of language models, especially transformer-based architectures,
has trickled into other domains giving rise to "scientific language models"
that operate on small molecules, proteins or polymers. In chemistry, language
models contribute to accelerating the molecule discovery cycle as evidenced by
promising recent findings in early-stage drug discovery. Here, we review the
role of language models in molecular discovery, underlining their strength in
de novo drug design, property prediction and reaction chemistry. We highlight
valuable open-source software assets thus lowering the entry barrier to the
field of scientific language modeling. Last, we sketch a vision for future
molecular design that combines a chatbot interface with access to computational
chemistry tools. Our contribution serves as a valuable resource for
researchers, chemists, and AI enthusiasts interested in understanding how
language models can and will be used to accelerate chemical discovery.Comment: Under revie
PaccMann: Designing anticancer drugs from transcriptomic data via reinforcement learning
With the advent of deep generative models in computational chemistry, in
silico anticancer drug design has undergone an unprecedented transformation.
While state-of-the-art deep learning approaches have shown potential in
generating compounds with desired chemical properties, they disregard the
genetic profile and properties of the target disease. Here, we introduce the
first generative model capable of tailoring anticancer compounds for a specific
biomolecular profile. Using a RL framework, the transcriptomic profiles of
cancer cells are used as a context for the generation of candidate molecules.
Our molecule generator combines two separately pretrained variational
autoencoders (VAEs) - the first VAE encodes transcriptomic profiles into a
smooth, latent space which in turn is used to condition a second VAE to
generate novel molecular structures on the given transcriptomic profile. The
generative process is optimized through PaccMann, a previously developed drug
sensitivity prediction model to obtain effective anticancer compounds for the
given context (i.e., transcriptomic profile). We demonstrate how the molecule
generation can be biased towards compounds with high predicted inhibitory
effect against individual cell lines or specific cancer sites. We verify our
approach by investigating candidate drugs generated against specific cancer
types and find the highest structural similarity to existing compounds with
known efficacy against these cancer types. We envision our approach to
transform in silico anticancer drug design by leveraging the biomolecular
characteristics of the disease in order to increase success rates in lead
compound discovery.Comment: 18 pages total (12 pages main text, 4 pages references, 11 pages
appendix) 8 figure
Domain-agnostic and Multi-level Evaluation of Generative Models
While the capabilities of generative models heavily improved in different
domains (images, text, graphs, molecules, etc.), their evaluation metrics
largely remain based on simplified quantities or manual inspection with limited
practicality. To this end, we propose a framework for Multi-level Performance
Evaluation of Generative mOdels (MPEGO), which could be employed across
different domains. MPEGO aims to quantify generation performance
hierarchically, starting from a sub-feature-based low-level evaluation to a
global features-based high-level evaluation. MPEGO offers great customizability
as the employed features are entirely user-driven and can thus be highly
domain/problem-specific while being arbitrarily complex (e.g., outcomes of
experimental procedures). We validate MPEGO using multiple generative models
across several datasets from the material discovery domain. An ablation study
is conducted to study the plausibility of intermediate steps in MPEGO. Results
demonstrate that MPEGO provides a flexible, user-driven, and multi-level
evaluation framework, with practical insights on the generation quality. The
framework, source code, and experiments will be available at
https://github.com/GT4SD/mpego
Accelerating Detection of Lung Pathologies with Explainable Ultrasound Image Analysis
Care during the COVID-19 pandemic hinges upon the existence of fast, safe, and highly sensitive diagnostic tools. Considering significant practical advantages of lung ultrasound (LUS) over other imaging techniques, but difficulties for doctors in pattern recognition, we aim to leverage machine learning toward guiding diagnosis from LUS. We release the largest publicly available LUS dataset for COVID-19 consisting of 202 videos from four classes (COVID-19, bacterial pneumonia, non-COVID-19 viral pneumonia and healthy controls). On this dataset, we perform an in-depth study of the value of deep learning methods for the differential diagnosis of lung pathologies. We propose a frame-based model that correctly distinguishes COVID-19 LUS videos from healthy and bacterial pneumonia data with a sensitivity of 0.90±0.08 and a specificity of 0.96±0.04. To investigate the utility of the proposed method, we employ interpretability methods for the spatio-temporal localization of pulmonary biomarkers, which are deemed useful for human-in-the-loop scenarios in a blinded study with medical experts. Aiming for robustness, we perform uncertainty estimation and demonstrate the model to recognize low-confidence situations which also improves performance. Lastly, we validated our model on an independent test dataset and report promising performance (sensitivity 0.806, specificity 0.962). The provided dataset facilitates the validation of related methodology in the community and the proposed framework might aid the development of a fast, accessible screening method for pulmonary diseases. Dataset and all code are publicly available at: https://github.com/BorgwardtLab/covid19_ultrasound
Accelerating Molecular Discovery with Generative Language Models: A journey through the chemical space
The discovery of new molecules and materials with desired properties is pivotal to our success in combatting global challenges such as the climate crisis or emerging diseases. However, navigating the discrete and practically infinite chemical search space while having to respect a cascade of multiproperty objectives is extremely challenging. In the past few decades, the chemical industry has faced not only a decline in productivity, but also ever-rising costs for the research and development of novel materials and molecules. Recently, molecular generative models coupled with virtual screening methods have shown promising results in efficient and systematic chemical space exploration. The hopes are high that such methods can accelerate the molecular discovery process, in particular when coupled with chemical synthesis planning tools and robotic hardware in automated laboratories. However, most generative models are optimized toward simplistic, chemo-centric objectives, disregard system-level information about the target environment of the molecule and can thus not be applied to generate molecules conditionally for a wide range of objectives. This thesis is about developing conditional molecular generative models that can be queried with a semantic context and flexibly generate molecules for desired conditions without the need of specific optimization. Moreover, this thesis aims to improve the "entanglement" of de novo design and property prediction by developing molecular generative models that possess inductive biases about continuous properties and also excel at predicting such properties. This is achieved by exploiting analogies between natural language and organic chemistry. Asaprerequisiteforgenerativemodeling, the first part of this thesis is devoted to building predictive models for molecular properties. The first chapter presents a simple, yet robust and interpretable chemical language model that heavily relies on data augmentation and is shown to exhibit strong performance across a wide range of properties such as toxicity. The next chapter develops proteochemometric language models for protein-ligand binding affinity prediction and demonstrates that by discarding more than 95% of the residues from the protein sequence, the performance of binding affinity prediction for human protein kinases significantly improves. The second part of this thesis focuses on the main goal of developing generative language models for conditional molecular design. Leveraging the property predictors in a reinforcement-learning optimization scheme yields a generative model that can be conditioned on a biomolecular context vector (e.g., a gene expression signature of a malignant tumour or a target protein) and generate molecules with high affinity toward this context. The experiments show that this method generalizes well and can propose molecules with high selectivity for unseen protein targets even in the absence of experimental data for such targets. In a case study on accelerated molecular discovery, the proposed generative model is integrated into a completely autonomous workflow that spans retrosynthesis models, synthesis protocol generation and the successful wet-lab synthesis on a robotic hardware. The last chapter then proposes a multitask language model that abstracts regression as a conditional sequence modeling problem and thus unifies the previous work on molecular property prediction and conditional generation within the same model. This model not only excels on regression tasks despite relying on a classification loss, it can also be conditioned simultaneously on arbitrary molecular substructures and continuous target properties. As demonstrated, this model outperforms specialized approaches in conditional molecular design and can decorate seed molecules, proteins or chemical reactions based on a desired property primer without the need of any optimization. This finds particular application in property-driven local exploration of the chemical space and paves the road toward foundation models in material design. Altogether, this thesis may contribute toward accelerated molecular discovery by providing methods to improve the quality of the average hypothesis that is considered for downstream chemical synthesis and wet-lab experimentation
Public Data HLoHCR.zip
This folder includes all data necessary to understand and replicate the findings presented in the paper "Hebbian Learning of Hand-Centred Representations in a Hierarchical Neural Network Model of the Primate Visual System".<br>This includes the code to implement the model we use, the workaround to run simulations, the simulation results and the statistical tools to analyse the results.<br><br